Understand how PA can be useful for well-being interventions
Identify key steps and challenges of a PA project
Apply PA project lifecycle steps
Content
General Introduction
People Analytics
People Analytics project lifecycle
Case studies
Summary
Resources
General Introduction
About me
M.Sc. in Psychology @ University of Catania
Research internship in Psychological Methods @ University of Amsterdam
Ph.D. in Methodology and Statistics @ Tilburg University
People Data Scientist @ Rabobank
About Rabobank
People @ Rabobank
~45k employees (~30k in the Netherlands)
Different chapters within HR (~850 employees)
People Data and Innovation (~40 members)
Diverse expertise and cultural backgrounds
What is People Analytics?
Data Science definition
Data science is a “concept to unify statistics, data analysis, informatics, and their related methods” to “understand and analyze actual phenomena” with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge (Wikipedia, 2023)
Data Science
In a Nutshell..
Data Science = Make Data Useful
In a Nutshell..
Data Science = Make Data Useful
People Analytics (or People Data Science)
“The analysis of employee and workforce data to reveal insights and provide recommendations to improve business outcomes” (Ferrer and Green, 2021)
“The organizational function within which data collection, analyses, and translation occur as well as a set of practices that draw on employee data to inform and aid decision-making processes and employee activity throughout the organization” (Polzer, 2022)
People Analytics
In a Nutshell..
People Analytics = Make People’s Data Useful
In a Nutshell..
People Analytics = Make People’s Data Useful
Why People Analytics
People Analytics provides tools, methodologies, and techniques to extract meaning out of employee data and make this data useful to:
Interpret of a large volume of data;
Identify trends and patterns in employee data;
Help to predict organizations’ and employees’ needs;
Prioritize HR activities based on impact utility and return on investment;
Reduce subjectivity and make decision-making transparent.
People Analytics Examples
Some interesting applications:
Employee retention
Enhancing employees’ well-being
Discovering occupational health risk factors
Reduce the pay gap among certain groups
Optimize Recruitment and Hire
Learning and Development
Increase diversity
…. etc
PA project lifecycle
flowchart
%%| fig-width: 10
A(Business Problem Discovery) --> B(Data Selection)
B --> C(Data Cleaning)
C --> D(Data Analysis)
D --> E(Interpretation and Storytelling)
E --> F(Implementation and Feedback)
F --> A
Business problem discovery
Business Problem Discovery
In this phase, there are a few primary goals:
Identify Problem
Define objectives and success metrics
Determine data sources and team
1) Identify problem
What is the problem?
Is it a problem? What’s the business value?
Who are the stakeholders involved?
What is the scope of the problem?
What is the problem time frame?
2) Define objectives
We normally try to answer these types of questions with People Analytics:
How much? (regression)
Which category? (classification)
Which group? (clustering)
Which option should be taken? (recommendation system)
2) Define success metrics
Specific
Measurable
Achievable
Relevant
Time-bound
3) Determine data sources and team
To ensure completeness of information and responsibilities, it is essential to:
Gather information about existing:
available data
reports and previous projects
documentation
Clarify roles and responsibilities
Example
The IT Department has been a critical part of the organization, driving innovation and ensuring systems run smoothly. However, we’ve noticed an alarming increase in stress-related sick leaves (~ 15%) among IT employees in the last year, which is higher than industry standards (7%). This not only affects their well-being but also disrupts our operations. To address this issue, we aim to develop a stress-risk classification system to identify high-stress cases early and implement targeted interventions. Our goal is to create a healthier work environment for our IT team.
Example
Identify problem:
Problem: Manage and reduce workplace stress in the IT Department
Problem size: Stress-related sick leaves have increased in the last year (+15%) and are higher than industry benchmarks. Also, there have been an increasing number of IT Teams with numerous sick leaves.
Stakeholders: direct (IT dept. employees); indirect (managers, HR)
Scope: IT Department employees
Time frame: 1 year
Define objectives: Classify employees into stress risk categories and tailor interventions.
Success metrics: Decrease the number of high-stress cases by 8% within one year from the application of data-driven interventions.
Exercise
Select one of the business cases available on CANVAS and a group of students to work with.
Throughout this lecture, you and your group will work as People Analytic experts and try to go through each of the PA lifecycle steps.
Exercise
For this step, discuss with others in your group how to define the following aspects based on the case description:
Identify problem
Problem
Problem size
Stakeholders
Scope
Define objective
Hypothesize Success Metrics
Data Acquisition
flowchart
%%| fig-width: 10
A(Business Problem Discovery) --> B(Data Selection)
Data Acquisition
Data selection refers to collecting, retrieving, gathering, and sourcing data.
Data
Surveys
Pros: affordable, familiar, and (if well designed!!) very effective
Cons: bias-sensitive and could induce fatigue
Performance reviews and rating forms
Pros: efficient and (potentially) informative
Cons: bias-sensitive and (often) unreliable
Surveillance and monitoring
Pros: objective behavioral measures, and rich.
Cons: intrusive, costly to store
Organisational information
Pros: Cheap and easily collected
Cons: varying data quality
Text and Scraped data
Pros: Rich and new tools facilitate text extraction and analysis
Cons: Privacy sensitive and require heavy pre-processing
Summary data
Method
Privacy
Resource
Objectivity
Familiarity
Complexity
Surveys
+
=
=
+
=
Rating
=
=
=
=
=
Monitoring
=
=
+
=
+
DB Queries
=
=
+
=
+
Scraping
-
+
=
=
+
Legend: (+) High; (=) Middle; (-) Low
Exercise
Discuss, within your group, what type of data you would like to use for your project. The data can also be other than the types we just discussed. Elaborate on your choice and explain:
Advantages of using the (combination of) data you proposed
Disadvantages of using the (combination of) data you proposed
Selecting the data that we deem appropriate does not, in itself, guarantee good quality.
Data may be compromised before (or after) acquisition. Typical data quality issues are often due to the following:
Incompleteness: missing values or lacking certain attributes.
Noisiness: Recording errors or outliers;
Inconsistencies: Conflicting records or discrepancies;
Data pre-processing
To ensure good quality levels, it is essential to:
Conduct data health screens
Pre-process data
Data health screens
Data Health can be generally assessed by checking:
Record Count: Determine the total number of records in the dataset.
Variables Count: Identify the number of variables or features in the dataset.
Data Types: Matching between expected and actual attribute type (e.g., nominal)
Missing Values: Count and assess missing values within the dataset.
Consistency: Examine data records for inconsistencies, such as verifying that values fall within specified ranges (e.g., 18 < age < 80)
Data pre-processing
Data pre-processing and cleaning are essential steps in preparing data for analysis. Typical steps taken during data pre-processing and cleaning include:
Handling Missing Values: Imputing or removing missing values.
Duplicate Detection: Identify and remove duplicate records.
Data Transformation: Convert data into a suitable format (e.g., standardization).
Outlier Detection: Identify and handle outliers.
Data Encoding: Encode categorical variables into numerical values to make them usable for analysis.
Data Discretization: Divide continuous variables into bins or categories for analysis.
Data Aggregation: Aggregate data to a higher level (e.g., monthly or yearly) for trend analysis.
Text Data Processing: Tokenize and preprocess text data.
Data Validation: Validate data against predefined business rules or HR policies.
Remember!! Keep detailed records of data pre-processing steps for transparency and reproducibility.
Data screening example
Starting from inspecting and visualizing the data is a good way to assess its properties.
Employee_Name
EmpID
MarriedID
MaritalStatusID
GenderID
EmpStatusID
DeptID
PerfScoreID
FromDiversityJobFairID
Salary
Termd
PositionID
Position
State
Zip
DOB
Sex
MaritalDesc
CitizenDesc
HispanicLatino
RaceDesc
DateofHire
DateofTermination
TermReason
EmploymentStatus
Department
ManagerName
ManagerID
RecruitmentSource
PerformanceScore
EngagementSurvey
EmpSatisfaction
SpecialProjectsCount
LastPerformanceReview_Date
DaysLateLast30
Absences
Adinolfi, Wilson K
10026
0
0
1
1
5
4
0
62506
0
19
Production Technician I
MA
1960
07/10/83
M
Single
US Citizen
No
White
7/5/2011
N/A-StillEmployed
Active
Production
Michael Albert
22
LinkedIn
Exceeds
4.60
5
0
1/17/2019
0
1
Ait Sidi, Karthikeyan
10084
1
1
1
5
3
3
0
104437
1
27
Sr. DBA
MA
2148
05/05/75
M
Married
US Citizen
No
White
3/30/2015
6/16/2016
career change
Voluntarily Terminated
IT/IS
Simon Roup
4
Indeed
Fully Meets
4.96
3
6
2/24/2016
0
17
Akinkuolie, Sarah
10196
1
1
0
5
5
3
0
64955
1
20
Production Technician II
MA
1810
09/19/88
F
Married
US Citizen
No
White
7/5/2011
9/24/2012
hours
Voluntarily Terminated
Production
Kissy Sullivan
20
LinkedIn
Fully Meets
3.02
3
0
5/15/2012
0
3
Alagbe,Trina
10088
1
1
0
1
5
3
0
64991
0
19
Production Technician I
MA
1886
09/27/88
F
Married
US Citizen
No
White
1/7/2008
N/A-StillEmployed
Active
Production
Elijiah Gray
16
Indeed
Fully Meets
4.84
5
0
1/3/2019
0
15
Anderson, Carol
10069
0
2
0
5
5
3
0
50825
1
19
Production Technician I
MA
2169
09/08/89
F
Divorced
US Citizen
No
White
7/11/2011
9/6/2016
return to school
Voluntarily Terminated
Production
Webster Butler
39
Google Search
Fully Meets
5.00
4
0
2/1/2016
0
2
Anderson, Linda
10002
0
0
0
1
5
4
0
57568
0
19
Production Technician I
MA
1844
05/22/77
F
Single
US Citizen
No
White
1/9/2012
N/A-StillEmployed
Active
Production
Amy Dunn
11
LinkedIn
Exceeds
5.00
5
0
1/7/2019
0
15
Data screening example
Constructing descriptive statistics is also helpful to assess data properties, range, etc.
vars
n
mean
sd
median
trimmed
mad
min
max
range
skew
kurtosis
se
Employee_Name*
1
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
EmpID*
2
311
10156.000
89.922
10156.00
10156.000
115.643
10001.00
10311
310.00
0.000
-1.212
5.099
MarriedID*
3
311
0.399
0.490
0.00
0.373
0.000
0.00
1
1.00
0.412
-1.836
0.028
MaritalStatusID*
4
311
0.810
0.943
1.00
0.651
1.483
0.00
4
4.00
1.395
1.969
0.053
GenderID*
5
311
0.434
0.496
0.00
0.418
0.000
0.00
1
1.00
0.265
-1.936
0.028
EmpStatusID*
6
311
2.392
1.794
1.00
2.241
0.000
1.00
5
4.00
0.626
-1.494
0.102
DeptID*
7
311
4.611
1.083
5.00
4.723
0.000
1.00
6
5.00
-1.522
2.153
0.061
PerfScoreID*
8
311
2.977
0.587
3.00
3.024
0.000
1.00
4
3.00
-1.236
3.921
0.033
FromDiversityJobFairID*
9
311
0.093
0.291
0.00
0.000
0.000
0.00
1
1.00
2.784
5.770
0.017
Salary*
10
311
69020.685
25156.637
62810.00
64523.671
11834.113
45046.00
250000
204954.00
3.274
15.069
1426.502
Termd*
11
311
0.334
0.473
0.00
0.293
0.000
0.00
1
1.00
0.699
-1.517
0.027
PositionID*
12
311
16.846
6.223
19.00
17.647
1.483
1.00
30
29.00
-1.220
0.756
0.353
Position*
13
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
State*
14
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
Zip*
15
311
6555.482
16908.397
2132.00
2170.173
340.998
1013.00
98052
97039.00
4.066
15.788
958.787
DOB*
16
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
Sex*
17
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
MaritalDesc*
18
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
CitizenDesc*
19
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
HispanicLatino*
20
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
RaceDesc*
21
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
DateofHire*
22
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
DateofTermination*
23
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
TermReason*
24
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
EmploymentStatus*
25
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
Department*
26
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
ManagerName*
27
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
ManagerID*
28
303
14.571
8.078
15.00
14.251
5.930
1.00
39
38.00
0.752
1.532
0.464
RecruitmentSource*
29
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
PerformanceScore*
30
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
EngagementSurvey*
31
311
4.110
0.790
4.28
4.210
0.726
1.12
5
3.88
-1.106
1.100
0.045
EmpSatisfaction*
32
311
3.891
0.909
4.00
3.916
1.483
1.00
5
4.00
-0.220
-0.784
0.052
SpecialProjectsCount*
33
311
1.219
2.349
0.00
0.711
0.000
0.00
8
8.00
1.524
0.589
0.133
LastPerformanceReview_Date*
34
311
NaN
NA
NA
NaN
NA
Inf
-Inf
-Inf
NA
NA
NA
DaysLateLast30*
35
311
0.415
1.295
0.00
0.012
0.000
0.00
6
6.00
3.113
8.595
0.073
Absences*
36
311
10.238
5.853
10.00
10.177
7.413
1.00
20
19.00
0.029
-1.311
0.332
Data Visualization
Data Visualization can also help in understanding the data and getting quick insights
Exercise
Based on the data you decided to use in your project, discuss within your group:
Potential data quality issues
Data health screens you would conduct
Data pre-processing steps
Visualization/Statistics you would investigate
Data Analysis
flowchart
%%| fig-width: 10
A(Business Problem Discovery) --> B(Data Selection)
B(Data Selection) --> C(Data Preparation)
C --> D(Data Analysis)
Data Analysis Goal
Analyze the (pre-processed) data and provide insights, and recommendations, or draw conclusions about the business problem.
Before deciding on what types of analyses would suit you best, it is essential to know the following:
What is the purpose of the analyses?
describe
understand
predict/classify
How interpretable should my model be?
Type of data
structured
unstructured
Mixed
none
What approach to choose?
Among some of the most commonly used tools in people analytics, we have:
Supervised learning: models for labeled data (i.e., outcome or dependent variable is available)
Unsupervised Learning: models for unlabeled data (i.e., outcome or dependent variable is unavailable)
Computational modeling and simulations: scenario analysis
Natural Language Processing (NLP): text analysis
People Analytics maturity model
People Analytics maturity model (actually)
Descriptive Analytics
Predict-ish Analytics
Predictive Analytics
Prescriptive Analytics
Exercies
Discuss within your group what type of analysis you would conduct for your case study. Specifically, focus on the following aspects:
How will your data analysis plan contribute to the project’s success?
Does it match the objective you defined earlier?
What kind of metric would you use in your analyses (e.g., mean, p-value), and why?
In what PA maturity stage (e.g., predictive analytics) would your proposed analysis fall?
Interpretation and Storytelling
flowchart
%%| fig-width: 10
A(Business Problem Discovery) --> B(Data Selection)
B(Data Selection) --> C(Data Preparation)
C --> D(Data Analysis)
D --> E(Interpretation and Storytelling)
Interpretation and storytelling
The pyramid principle
Data Visualization
The Pyramid Principle
The Pyramid principle technique has been developed by McKinsey & Company and, successively published by Barbara Minto (Minto, 2008, 3rd edition).
This method helps in structuring thinking and convincing audience. The aim, ultimately, is to improve the impact of a project significantly.
The pyramid principle
The Pyramid Principle is a communication strategy that emphasizes presenting information in a structured and impactful way through a hierarchical structure:
Main Message: Concise statement that answers the critical question
Main Arguments: Several independent arguments that support the main message
Supporting Evidence: Back up your arguments with relevant evidence
This approach simplifies complex information, making it easier to understand and engage with the audience.
The Pyramid Principle
flowchart TD
O[Business problem] --> P[Situation, Complication, Question]
P --> E(Analyses)
E --> A
A[Main message] --> |why| B[Main Argument 1]
A --> |why| C[Main Argument 2]
A --> |how| D[Main Argument 3]
B --> b1[ev. 1]
B --> b2[ev. 2]
B --> b3[ev. 3]
C --> c1[ev. 4]
C --> c2[ev. 5]
C --> C3[ev. 6]
D --> d1[ev. 7]
D --> d2[ev. 8]
D --> d3[ev. 9]
A Well-being example
flowchart TD
O[Business problem] --> P[Reduce sick leave]
P --> E(Analyses)
E --> A
A[Promote Wellbeing] --> |why| B[Improve Employee health]
A --> |why| C[Enhances work-life balance]
A --> |how| D[Foster Positive work environment]
B --> b1[ev. 1]
B --> b2[ev. 2]
B --> b3[ev. 3]
C --> c1[ev. 4]
C --> c2[ev. 5]
C --> C3[ev. 6]
D --> d1[ev. 7]
D --> d2[ev. 8]
D --> d3[ev. 9]
Data Visualization principles
Define your message
Understand analyses behind the message
Pick a suitable graph
Check the graph for clarity
formulate message
less is more
guide attention
Data Visualization charts
Implementation and feedback
flowchart
%%| fig-width: 10
A(Business Problem Discovery) --> B(Data Selection)
B(Data Selection) --> C(Data Preparation)
C --> D(Data Analysis)
D --> E(Interpretation and Storytelling)
E --> F(Implementation and Feedback)
F --> A
Implementation and feedback
Data and Results are only the starting point of a conversation
Align with stakeholders
Change management
Ethical and legal considerations
Summary
flowchart
%%| fig-width: 10
A(Business Problem Discovery) --> B(Data Selection)
B(Data Selection) --> C(Data Preparation)
C --> D(Data Analysis)
D --> E(Interpretation and Storytelling)
E --> F(Implementation and Feedback)
F --> A